Social Media: Friend or Foe of Natural Language Processing?

نویسنده

  • Timothy Baldwin
چکیده

In this talk, I will outline some of the myriad of challenges and opportunities that social media offer for natural language processing. I will present analysis of how pre-processing can be used to make social media data more amenable to natural language processing, and review a selection of tasks which attempt to harness the considerable potential of different social media services. There is no question that social media are fantastically popular and varied in form — ranging from user forums, to microblogs such as Twitter, to social networking sites such as Facebook — and that much of the content they host is in the form of natural language. This would suggest a myriad of opportunities for natural language processing (NLP), and yet much of the applied research on social media which uses language data is based on superficial analysis, often in the form of simple keyword search. This begs the question: Are NLP methods not suited to social media analysis? Conversely, is social media data too challenging for modern-day NLP? Alternatively, are simple term search-based methods sufficient for social media analysis, i.e. is NLP overkill for social media? In exploring these questions, I attempt to answer the overarching question of whether social media data is the friend or foe of NLP. I approach the question first from the perspective of what challenges social media language poses for NLP. The most immediate answer is the infamously free-form nature of language in social media, encompassing spelling inconsistencies, the free-form adoption of new terms, and regular violations of English grammar norms. Unsurprisingly, when NLP tools are applied directly to social media data, the results tend to be miserable when compared to data sets such as the Wall Street Journal component of the Penn Treebank. However, there have been recent successes in adapting parsers and POS taggers to social media data (Foster et al., 2011; Gimpel et al., 2011). Additionally, lexical normalisation and other preprocessing strategies have been shown to enhance the performance of NLP tools over social media data (Lui and Baldwin, 2012; Han et al., to appear). Furthermore, social media posts tend to be short and the content highly varied, meaning it is difficult to adapt a tool to the domain, or harness textual context to disambiguate the content. There is also the engineering challenge of real-time processing of the text stream, as much of NLP research is carried out offline with only secondary concern for throughput. As such, we might conclude that social media data is a foe of NLP, in that it challenges traditional assumptions made in NLP research on the nature of the target text and the requirements for real-time responsiveness. However, if we look beyond the immediate text content of social media, we quickly realise that there are various non-textual data sources that can be used to enhance the robustness and accuracy of NLP models, in a way which is not possible with static text corpora. For example, simple information on the author of a post can be used to develop authoradapted models based on the previous posts of the same individual (at least for users who post sufficiently large volumes of data). Links in the post can be used to disambiguate the textual content of the post, whether in the form of URLs and the content contained in the target document(s), hashtags and the content of other similarly-tagged posts, thread-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Digital Libraries in a Clinical Setting: Friend or Foe?

Clinical requirements for quick accessibility to reputable, up-todate information have increased the importance of web accessible digital libraries for this user community. To understand the social and organisational impacts of ward -accessible digital libraries (DLs) for clinicians, we conducted a study of clinicians’ perceptions of electronic information resources within a large London based ...

متن کامل

Friend or Foe? a Natural Experiment of the Prisoner’s Dilemma

This study examines data drawn from the game show Friend or Foe? which is similar to the classic prisoner’s dilemma tale: partnerships are endogenously determined, and players work together to earn money, after which they play a one-shot prisoner’s dilemma game over large stakes: varying from $200 to (potentially) more than $22,000. The data reveal several interesting insights; perhaps most pro...

متن کامل

Apocrine Metaplasia in Intraductal Papilloma with Foci of DCIS: A Friend or Foe?

Malignant papillary neoplasms of the breast comprise a number of microscopically distinct lesions, where apocrine metaplasia is commonly found in papillomas compared to other papillary lesions including papillary carcinomas. However, association of apocrine metaplasia in papilloma with Ductal Carcinoma in Situ (DCIS) is not very well defined. The lesions with apocrine metaplasia are not only di...

متن کامل

Using Generalized Language Model for Question Matching

Question and answering service is one of the popular services in the World Wide Web. The main goal of these services is to finding the best answer for user's input question as quick as possible. In order to achieve this aim, most of these use new techniques foe question matching. . We have a lot of question and answering services in Persian web, so it seems that developing a question matching m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012